Visualização de Dados

Nesta atividade trabalharemos o básico de visualização de dados.

Atividade 1

Visualizando um espectro estelar.

O primeiro passo é obter o arquivo do espectro. Baixe-o daqui: https://drive.google.com/open?id=12MCvaypCE7jyvRy6ohN93cYvSoXVRsjE

em seguida voce deve ler o arquivo. Para tanto, use a função loadtxt do numpy. Para ajuda com estta função veja: https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

Com os dados carregados, agora voce pode visualizá-los. Para tanto utilize a função plot: https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html

O que voce observa neste gráfico? O que se pode dizer sobre os dados? Veja a escala dos eixos? O que isso lhe diz?

Como voce pode melhorar sua visualização? Tente alterar a escala usando os parametros da função plot. Voce consegue visualizar mais informações?

Ainda é possível melhorar mais?

Atividade 2

Nesta atividade vamos explorar um conjunto de dados usando o pacote pandas: https://pandas.pydata.org/

O data frame em pandas:

alt text

Pandas DataFrame é uma estrutura de dados tabulares bidimensionais, mutáveis e potencialmente heterogêneos, com eixos rotulados (linhas e colunas). Um quadro de dados é uma estrutura de dados bidimensional, ou seja, os dados são alinhados de maneira tabular em linhas e colunas. Pandas DataFrame consiste em três componentes principais, os dados, linhas e colunas.

Nesta atividade usaremos data frames e as funcionalidades do Pandas para analisar o conjunto de dados Iris: https://archive.ics.uci.edu/ml/datasets/iris

O arquivo para ser lido como data frame está aqui: https://drive.google.com/open?id=1S6RHV3l-xPSHdsBc1sOu60_nwPUQucEw

Vamos ler então o arquivo



In [1]:

    
# importing pandas package
import pandas as pd
 
# making data frame from csv file
data = pd.read_csv("iris.csv")



In [2]:

    
data.head()









    Out[2]:







  
    
      
      sepal.length
      sepal.width
      petal.length
      petal.width
      variety
    
  
  
    
      0
      5.1
      3.5
      1.4
      0.2
      Setosa
    
    
      1
      4.9
      3.0
      1.4
      0.2
      Setosa
    
    
      2
      4.7
      3.2
      1.3
      0.2
      Setosa
    
    
      3
      4.6
      3.1
      1.5
      0.2
      Setosa
    
    
      4
      5.0
      3.6
      1.4
      0.2
      Setosa



In [3]:

    
data.dtypes









    Out[3]:





sepal.length    float64
sepal.width     float64
petal.length    float64
petal.width     float64
variety          object
dtype: object



In [4]:

    
data.columns









    Out[4]:





Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')



In [5]:

    
# descrevendo os dados
data.describe()









    Out[5]:







  
    
      
      sepal.length
      sepal.width
      petal.length
      petal.width
    
  
  
    
      count
      150.000000
      150.000000
      150.000000
      150.000000
    
    
      mean
      5.843333
      3.057333
      3.758000
      1.199333
    
    
      std
      0.828066
      0.435866
      1.765298
      0.762238
    
    
      min
      4.300000
      2.000000
      1.000000
      0.100000
    
    
      25%
      5.100000
      2.800000
      1.600000
      0.300000
    
    
      50%
      5.800000
      3.000000
      4.350000
      1.300000
    
    
      75%
      6.400000
      3.300000
      5.100000
      1.800000
    
    
      max
      7.900000
      4.400000
      6.900000
      2.500000



In [6]:

    
data['sepal.length']









    Out[6]:





0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
5      5.4
6      4.6
7      5.0
8      4.4
9      4.9
10     5.4
11     4.8
12     4.8
13     4.3
14     5.8
15     5.7
16     5.4
17     5.1
18     5.7
19     5.1
20     5.4
21     5.1
22     4.6
23     5.1
24     4.8
25     5.0
26     5.0
27     5.2
28     5.2
29     4.7
      ... 
120    6.9
121    5.6
122    7.7
123    6.3
124    6.7
125    7.2
126    6.2
127    6.1
128    6.4
129    7.2
130    7.4
131    7.9
132    6.4
133    6.3
134    6.1
135    7.7
136    6.3
137    6.4
138    6.0
139    6.9
140    6.7
141    6.9
142    5.8
143    6.8
144    6.7
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal.length, Length: 150, dtype: float64



In [7]:

    
data.mean()









    Out[7]:





sepal.length    5.843333
sepal.width     3.057333
petal.length    3.758000
petal.width     1.199333
dtype: float64



In [8]:

    
data.plot()









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f446cb9a908>



In [9]:

    
data.hist()









    Out[9]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f446c7aa1d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c7532e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f446c778860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c722dd8>]],
      dtype=object)



In [10]:

    
from pandas.plotting import scatter_matrix

scatter_matrix(data, alpha=0.2, diagonal='kde', figsize=(10, 10))









    Out[10]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f446c657c18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c5e9940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c590eb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c5c0470>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f446c5679e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c50ff60>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c53f518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c4e8ac8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f446c4e8b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c4c15c0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c46bb38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c41c0f0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f446c442668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c3ebbe0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c39d198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f446c3c4710>]],
      dtype=object)

vamo susar agora uma biblioteca especifica para visualização estatística de dados a Seaborn:

https://seaborn.pydata.org/



In [11]:

    
# Seaborn: Biblioteca de Visualização de Dados Estatísticos do Python
import seaborn as sns   
import matplotlib.pyplot as plt
sns.pairplot(data)









    Out[11]:





<seaborn.axisgrid.PairGrid at 0x7f446c68dc88>



In [12]:

    
sns.pairplot(data, hue="variety")









    Out[12]:





<seaborn.axisgrid.PairGrid at 0x7f4468383eb8>



In [13]:

    
sns.heatmap(data.corr(),annot=True)









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f4463958b00>



In [14]:

    
fig = sns.FacetGrid(data,hue='variety')
fig.map(plt.scatter,'sepal.length','sepal.width').add_legend()









    Out[14]:





<seaborn.axisgrid.FacetGrid at 0x7f445f68c6d8>

Podemos fazer um plot mostrando as propriedades de algumas colunas de dados. Um tipo muito útil de gráfico é o boxplot ou diagrama de caixa: https://pt.wikipedia.org/wiki/Diagrama_de_caixa



In [15]:

    
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.boxplot(x='variety',y='sepal.length',data=data)
plt.subplot(2,2,2)
sns.boxplot(x='variety',y='sepal.width',data=data)









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f445f5c93c8>

Atividade 3

Vamos investigar dados do aglomerado aberto Pleiades obtidos com o satelite GAIA. O objetivo aqui é identificar as propriedades do aglomerado que permitam separa-lo das estrelas de campo. Sabemos que movimentos pŕoprios e paralaxes são bons indicadores.

O arquivo se encontra em: https://drive.google.com/file/d/1bU5hqfKWzwjvJLSLFKLhCUn1EdSkUpNL/view?usp=sharing

dicas:

espaços em branco nos dados podem gerar problemas. Use .astype('float64') para definir como float
pode ser util usar um tipo de grafico que mostre a densidade de pontos para duas variáveis. Veja aqui: https://seaborn.pydata.org/examples/joint_kde.html



In [ ]:

	sepal.length	sepal.width	petal.length	petal.width	variety
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa

	sepal.length	sepal.width	petal.length	petal.width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000